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Abstract 

In this paper, we introduce a new dataset consisting of 
360,001 focused natural language descriptions for 10,738 
images. This dataset, the Visual Madlibs dataset, is col¬ 
lected using automatically produced fill-in-the-blank tem¬ 
plates designed to gather targeted descriptions about: peo¬ 
ple and objects, their appearances, activities, and interac¬ 
tions, as well as inferences about the general scene or its 
broader context. We provide several analyses of the Vi¬ 
sual Madlibs dataset and demonstrate its applicability to 
two new description generation tasks: focused description 
generation, and multiple-choice question-answering for im¬ 
ages. Experiments using joint-embedding and deep learn¬ 
ing methods show promising results on these tasks. 


1. Introduction 

Much of everyday language and discourse concerns the 
visual world around us, making understanding the rela¬ 
tionship between the physical world and language describ¬ 
ing that world an important challenge problem for AI. 
Understanding this complex and subtle relationship will 
have broad applicability toward inferring human-like under¬ 
standing for images, producing natural human robot interac¬ 
tions, and for tasks like natural language grounding in NLP. 
In computer vision, along with improvements in deep learn¬ 
ing based visual recognition, there has been an explosion of 
recent interest in methods to automatically generate natural 
language descriptions for images ||5l |9l [TS] |32l [16] l20ll or 
videos ME- However, most of these methods and exist¬ 
ing datasets have focused on only one type of description, a 
generic description for the entire image. 

In this paper, we collect a new dataset of focused, tar¬ 
geted, descriptions, the Visual Madlibs dataset, as illus¬ 
trated in Figure 12 To collect this dataset, we introduce au¬ 
tomatically produced fill-in-the-blank templates designed to 
collect a range of different descriptions for visual content in 
an image. For example, a user might be presented with an 



1. This place is a park . 

2. When I look at this picture, I feel competitive . 

3. The most interesting aspect of this picture is the guys playing shirtless . 

4. One or two seconds before this picture was taken, the person caught the frisbee . 

5. One or two seconds after this picture was taken, the guv will throw the frisbee . 

6. Person A is wearing blue shorts . 

7. Person A is in front of person B . 

8. Person A is blocking person B . 

9. Person B is a voung man wearing an orange hat . 

10. Person B is on a grassy field . 

11. Person B is holding a frisbee . 

12. The frisbee is white and round . 

13. The frisbee is in the hand of the man with the orange cap . 

14. People could throw the frisbee. 

15. The people are playing with the frisbee. 

Figure 1; An example from the Visual Madlibs Dataset. 
This dataset collects targeted descriptions for people and 
objects, denoting their appearances, affordances, activities, 
and interactions. It also provides descriptions of broader 
emotional, spatial and temporal context for an image. 


image and a fill-in-the-blank template such as “The frisbee 
is [blank]” and asked to fill in the [blank] with a descrip¬ 
tion of the appearance of frisbee. Alternatively, they could 
be asked to fill in the [blank] with a description of what 
the person is doing with the frisbee. Fill-in-the-blank ques¬ 
tions can be targeted to collect descriptions about people 
and objects, their appearances, activities, and interactions, 
as well as descriptions of the general scene or the broader 
emotional, spatial, or temporal context of an image. Us¬ 
ing these templates, we collect a large collection of 360,001 
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This place is a(n) roa.d. 

When I look at this picture, I fe^free. 

St interesting aspect of this picture is the motorcycles . 


This place is a(n) restaurant . 

When I look at this picture, I feel like I wa nt donu ts. 

St interesting aspect of this picture is the box of doughnuts . 


One or two seconds before this picture was taken, they stooped to chat and d ecided where to go. One or two secortds before this picture was taken, t he box was closed . 
Oneor two seconds after this picture was taken, the bikers ride down the road. One or two seconds after this picture was taken, a thi rd person out of frai 


This place is a{n) water_way. 

When I look at tfvs picture, I feel concerned. 

The most interesting aspect of this picture is the men standing on elephants . 

One or two seconds before this picture was taken, the peop le were sittirso on the elephant . 
■ picks up a do ughnut. Oneortwosecondsafter this picture was taken, the.rn.engot.offthe^elephants. 



Person A is a girl . Person C is a short haired black girl . 

Person A is learning_howto.surf. Person C is practicing.surfing. 
Person A is kneeling on a surfboard . Person C is on a gold surfboard . 
Person B is a man in blue shorts . Person D is a ladv in board shorts . 

Person B is walking on a beach . Person D is standing aroutsd . 

Person B is at the beach . Person D is next to a Wue surfboard . 



The couches are white . The TV is on. 

The couches are in the center of the room . The TV is near the wall . 
People could relax on the couches. People could watch theTV- 







Person A Is at a zoo. 

Person C is at a zoo. 

Person B is waring a grey T-shirt. 

Person R is talkinn to two other neonle. 
Person R is next to an elenhant. 




The car is s^te. The umbrellas are open. 

The car is on a cortcrete pad . The umbrellas are in the people's hands . 
People could ride in the car. People could stay dry under the umbrellas. 



Person A is a balding male . 
Person A Is playing a_yideo.s3rri.e. 
Person A is downstairs . 

Person B is wearing jeans . 

Person B is clavino wii . 

Person B is in a basement . 


Person C is wearing dark clothes . 
Person C is walklngupstairs. 

Person C is in a building . 

Person D is mostly out of the photo . 
Person D is walking away . 

Person D is irtthe basement . 



The person is putting food in the bowl. 


Person A is a young man in green . 
Person A is tiyingto.bl.ock a_frisbee. 
Person A is on afield . 

Pers<Mt B is wearing purple . 

Person B is thrgwjngjsfnsbee. 
Person B is on a field . 



The people are eadngcake at the dining table. 
The people are serving the cake. 


Figure 2: Madlibs description. The first row corresponds to question types 1-5, the second row corresponds to question types 
9-11, and the third row is to question types 6-8 and question type 12. All question types are listed in Table|^ 


targeted descriptions for 10,738 images. Fig.j^shows some 
Madlibs description samples. 

With this new dataset, we can develop methods to gen¬ 
erate more focused descriptions. Instead of asking an algo¬ 
rithm to “describe the image” we can now ask for more fo¬ 
cused descriptions such as “describe the person”, “describe 
what the person is doing,“ or “describe the relationship be¬ 
tween the person and the frisbee.” We can also ask ques¬ 
tions about aspects of an image that are somewhat beyond 
the scope of the directly depicted content. For example, 
“describe what might have happened just before this picture 
was taken.” or “describe how this image makes you feel.” 
These types of descriptions reach toward high-level goals of 
producing human-like visual interpretations for images. 

In addition to focused description generation, we also in¬ 
troduce a multiple-choice question-answering task for im¬ 
ages. In this task, the computer is provided with an image 
and a partial description such as “The person is [blank]”. 
A set of possible answers is also provided, one answer that 
was written about the image in question, and several ad¬ 


ditional answers written about other images. The com¬ 
puter is evaluated on how well it can select the correct 
choice. In this way, we can evaluate performance of de¬ 
scription generation on a concrete task, making evaluation 
more straightforward. Varying the difficulty of the nega¬ 
tive answers—adjusting how similar they are to the correct 
answer—provides a nuanced measurement of performance. 

For both the generation and question-answering tasks, 
we study and evaluate a recent state of the art approach 
for image description generation 1^ . as well as a simple 
joint-embedding method learned on deep representations. 
The evaluation also includes extensive analysis of the Vi¬ 
sual Madlibs dataset and comparisons to the existing MS 
COCO dataset of natural language descriptions for images. 

In summary, our contributions are: 

1) A new description collection strategy. Visual Madlibs, for 
constructing fill-in-the-blank templates to collect targeted 
natural language descriptions. 

2) A new Visual Madlibs Dataset consisting of 360,001 tar¬ 
geted descriptions, spanning 12 different types of templates. 
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for 10,738 images, as well as analysis of the dataset and 
comparisons to existing MS COCO descriptions. 

3) Evaluation of a generation method and a simple joint em¬ 
bedding method for targeted description generation. 

4) Dehnition and evaluation of generation and joint¬ 
embedding methods on a new task, multiple-choice fill-in- 
the-blank question answering for images. 

The rest of our paper is organized as follows. First, we 
review related work (Sec [^. Then, we describe our strat¬ 
egy for automatically generating fill-in-the-blank templates 
and introduce our Visual Madlibs dataset (Sec[^. Next we 
outline the multiple-choice question answering and targeted 
generation tasks (Sec|^ and provide several analyses of our 
dataset (Sec|^. Finally, we provide experiments evaluating 
description generation and joint-embedding methods on the 
proposed tasks (Sec[^ and conclude (Sec[^. 

2. Related work 

Description Generation: Recently, there has been an 
explosion of interest in methods for producing natural lan¬ 
guage descriptions for images or video. Early work in this 
area generally explored two complementary directions. The 
first type of approach focused on detecting content elements 
such as objects, attributes, activities, or spatial relationships 
and then composing captions for images na i33i EH [ini 
or videos ini using linguistically inspired templates. The 
second type of approach explored methods to make use of 
existing text either directly associated with an image HU HI 
or retrieved from visually similar images EUHlEll. 

With the advancement of deep learning for content es¬ 
timation, there have been many exciting recent attempts to 
generate image descriptions using neural network based ap¬ 
proaches. Some methods hrst detect words or phrases using 
Convolutional Neural Network (CNN) features, then gen¬ 
erate and re-rank candidate sentences ISlEQl. Other ap¬ 
proaches take a more end-to-end approach to generate out¬ 
put descriptions directly from images. Kiros et al. m 
learn a joint image-sentence embedding using visual CNNs 
and Long Short Term Memory (LSTM) networks. Simi¬ 
larly, several other methods have made use of CNN features 
and LSTM or recurrent neural networks (RNN) for gener¬ 
ation with a variety of different architectures EH [El El. 
These new methods have shown great promise for image 
description generation under some measures (e.g. BLEU- 
1) achieving near-human performance levels. We look at 
related, but more focused description generation tasks. 

Description Datasets: Along with the development of 
image captioning algorithms there have been a number of 
datasets collected for this task. One of the first datasets col¬ 
lected for this problem was the UIUC Pascal Sentence data 
set iflOl which contains 1,000 images with 5 sentences per 
image written by workers on Amazon Mechanical Turk. As 
the description problem gained popularity larger and richer 


datasets were collected, including the FlickrSK l28l and 
Flicki'30K ll34l datasets, containing 8,000 and 30,000 im¬ 
ages respectively. In an alternative approach, the SBU Cap¬ 
tioned photo dataset lIZTll contains 1 million images with ex¬ 
isting captions collected from Flickr. This dataset is larger, 
but the text tends to contain more contextual information 
since captions were written by the photo owners. Most re¬ 
cently, Microsoft released the MS COCO IItTI dataset. MS 
COCO contains 120,000 images depicting 80 common ob¬ 
ject classes, with object segmentations and 5 turker writ¬ 
ten descriptions per image. These datasets have been one 
of the driving forces in improving methods for description 
generation, but are currently limited to a single description 
about the general content of an image. We make use of MS 
COCO data, extending the types of descriptions associated 
with images. 

Question-answering Natural language question¬ 
answering has been a long standing goal of NLP, with 
commercial companies like Ask-Jeeves or Google playing a 
signihcant role in developing effective methods. Recently, 
embedding and deep learning methods have shown great 
promise for question-answering oniEiii. Lin et al. m 
take an interesting multi-modal approach to question¬ 
answering. A multiple-choice text-based question is first 
constructed from 3 sentences written about an image; 2 of 
the sentences are used as the question, and 1 is used as the 
positive answer, mixed with several negative answers from 
sentences written about other images. The authors develop 
ranking methods to answer these questions and show that 
generating abstract images for each potential answer can 
improve results. Note, here the algorithms are not provided 
with an image as part of the question. Some recent work 
has started to look at the problem of question-answering 
for images. Malinowski et al. ||23]| combine computer 
vision and NLP in a Bayesian framework, but restrict their 
method to scene based questions. Geman et al. ifT^ design 
a visual Turing test to test image understanding using a 
series of binary questions about image content. We design 
more general question-answering tasks that allow us to ask 
a variety of different types of natural language questions 
about images. 

3. Designing and collecting Visual Madlibs 

The goal of Visual Madlibs is to study targeted natural 
language descriptions of image content that go beyond 
describing which objects are in the image, and beyond 
generic descriptions of the whole image. The experiments 
in this paper begin with a dataset of images where the 
presence of some objects have already been labelecQ The 
prompts for the Madlibs-style hll-in-the-blank questions 
are automatically generated based on image content, in a 

*More generally, acquiring such labels could be included as part of 
collecting Madlibs. 


3 



Type 

Instruction 

Prompt 

#words 

1. image’s scene 

Describe the type of scene/place shown in this picture. 

The place is a(n) 

4+1.45 

2. image’s emotion 

Describe the emotional content of this picture. 

When I look at this picture, feel 

8+1.14 

3. image’s interesting 

Describe the most interesting or unusual aspect of this picture. 

The most interesting aspect of this picture is. 

8+3.14 

4. image’s past 

Describe what happened immediately before this picture was taken. 

One or two seconds before this picture was taken, 

9+5.45 

5. image’s future 

Describe what happened immediately after this picture was taken. 

One or two seconds after this picture was taken, 

9+5.04 

6. object’s attribute 

Describe the appearance of the indicated object. 

The object(s) is/are 

3.20+1.62 

7. object’s affordance 

Describe the function of the indicated object. 

People could_the object(s). 

4.20+1.74 

8. object’s position 

Describe the position of the indicated object. 

The object(s) is/are 

3.20+3.35 

9. person’s attribute 

Describe the appearance of the indicated person/people. 

The person/people is/are 

3+2.52 

10. person’s activity 

Describe the activity of the indicated person/people. 

The person/people is/are 

3+2.47 

11. person’s location 

Describe the location of the indicated person/people. 

The person/people is/are 

3.20+3.04 

12. pair’s relationship 

Describe the relationship between the indicated person and object. 

The person/people is/are_the object(s). 

5.20+1.65 


Right-most column shows the average number of words for each 


Table 1; All 12 types of Madlibs instructions and prompts, 
description (#words for prompt + #words for answer). 

manner designed to elicit more detailed descriptions of the 
objects, their interactions, and the broader context of the 
scene shown in each image. 

Visual Madlibs: Image-Hlnstructiom-PromptsH-Blank 

A single fill-in-the-blank question consists of a prompt and 
a blank, e.g.. Person A is [blank] the car. The implicit ques¬ 
tion is, “What goes in the blank?” This is presented to a 
person along with an image and instructions, e.g.. Describe 
the relationship between the indicated person and object. 
The same image and prompt may be used with different in¬ 
structions to collect a variety of description types. 
Instantiating Questions 

While the general form of the questions for the Visual 
Madlibs were chosen by hand, see Table[2 most of the ques¬ 
tions are instantiated depending on a subset of the objects 
present in an image. For instance, if an image contained 
two people and a dog, questions about each person (ques¬ 
tion types 9-11 in Table[2l, the dog (types 6-8), relationships 
between the two people and the dog (type 12), could be in¬ 
stantiated. For each possible instantiation, the wording of 
the questions might alter slightly to maintain grammatical 
consistency. In addition to these types of questions that de¬ 
pend on the objects present in the image, other questions 
(types 1-5) can be instantiated for an image regardless of 
the objects present. 

Notice in particular the questions about the temporal 
context - what might have happened before or what might 
happen after the image was taken. People can make in¬ 
ferences beyond the specific content depicted in an image. 
Sometimes these inferences will be consistent between peo¬ 
ple (e.g., when what will happen next is obvious), and other 
times these descriptions may be less consistent. We can 
use the variability of returned responses to select images 
for which these inferences are reliable. 

Asking questions about every object and all pairs of ob¬ 
jects quickly becomes unwieldy as the number of objects 
increases. To combat this, we choose a subset of objects 


present to use in instantiating questions. Such selection 
could be driven by a number of factors. The experiments 
in this paper consider comparisons to existing, general, de¬ 
scriptions of images, so we instantiate questions about the 
objects mentioned in those existing natural language de¬ 
scriptions. Whether an object is mentioned in an image 
description can be viewed as an indication of the object’s 
importance El. 

3.1. Data Collection 

To collect the Visual Madlibs Dataset we use a subset of 
10,738 human-centric images from MS COCO, that make 
up about a quarter of the validation data ED, and instanti¬ 
ate fill-in-the-blank templates as described above. The MS 
COCO images are annotated with a list of objects present in 
the images, segmentations for the locations of those objects, 
and 5 general natural language descriptions of the image. To 
select the subset of images for collecting Madlibs, we start 
with the 19,338 images with a person labeled. We then look 
at the five descriptions for each and perform a dependency 
parse El , only keeping those images where a word referring 
to a person (woman, man, etc. E.g., in Fig.[^ guys, men) is 
the head noun for part of the parse. This leaves 14,150 im¬ 
ages. We then filter out the images whose descriptions do 
not include a synonym for any of the 79 non-person object 
categories labeled in the MS COCO dataset. This leaves 
10,738 human-centric images with at least one other object 
from the MS COCO data set mentioned in the general im¬ 
age descriptions. 

Before final instantiation of the fill-in-the blank tem¬ 
plates, we need to resolve a potential ambiguity regarding 
which objects are referred to in the descriptions. There 
could be several different people or different instances of an 
object type labeled in an image. It is not immediately obvi¬ 
ous which ones are described in the sentences. To address 
this assignment problem, we estimate the quantity of each 
described person/object in the sentence by parsing the de¬ 
terminant (two men and a frisbee in Fig.[^, the conjunction 
(a man and a woman), and the singular/plural form (dog. 
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dogs). We compare this number with the number of anno¬ 
tated instances for each category, and consider two possible 
cases: 1) there are fewer annotated instances than the sen¬ 
tences describe, 2) there are more annotated instances than 
the sentences describe. It is easy to address the hrst case, 
just construct templates for all of the labeled instances. For 
the second case, we sort the area of each segmented in¬ 
stance, and pick the largest ones up to the parsed number 
for instantiation. Using this procedure, we obtain 26,148 
labeled object or person instances in the 10,738 images. 

Each Visual Madlib is answered by 3 workers on Ama¬ 
zon’s Mechanical Turk. To date, we have collected 360,001 
answers to Madlib questions. Some example Madlibs an¬ 
swers are shown in Fig. 

Labeled instances; 

person , person , frisbec , car, car, cai' 

MS COCO descriptions: 

1. Two guys are playing Frisbee in the park. 

2. Two young men p laying a sliirtless game of frisbee . 

3. Two men in a giassy field playing with a frisbee. 

4. Two men are playing with a fiisbee together. 

5. Two shirtless men p laying frisbee in a field. 


Figure 3; COCO instance annotation and descriptions for 
the image of Fig.[T] We show how we map labeled instances 
to the mentioned person and object in the sentence. 


4. Tasks: Multiple-choice question answering 
and targeted generation 

We design two tasks to evaluate targeted natural lan¬ 
guage description for images. The hrst task is to automat¬ 
ically generate natural language descriptions of images to 
hll in the blank for one of the Madlibs questions. This 
allows for producing targeted descriptions such as: a de¬ 
scription specihcally focused on the appearance of an ob¬ 
ject, or a description about the relationship between two 
objects. The input to this task is an image, instructions, 
and a Madlibs prompt. As has been discussed at length in 
the community working on description generation for im¬ 
ages, it can be difficult to evaluate free form generation. 
Our second task tries to address this issue by developing 
a new targeted multiple-choice question answering task for 
images. Here the input is again an image, instruction, and 
a prompt, but instead of a free form text answer, there are 
a hxed set of multiple-choice answers to hll in the blank. 
The possible multiple-choice answers are sampled from the 
Madlibs responses, one that was written for the particular 
image/instruction/prompt as the correct answer, and distrac- 
tors chosen from either similar images or random images 
depending on the level of difficulty desired. This ability to 
choose distractors to adjust the difficulty of the question as 
well as the relative ease of evaluating multiple choice an¬ 
swers are attractive aspects of this new task. 


In our experiments we randomly select 20% of the 
10,738 images to use as our test set for evaluating these 
tasks. For the multiple-choice questions we form two sets of 
answers for each, with one set designed to be more difficult 
than the other. We hrst establish the easy task distractor an¬ 
swers by randomly choosing three descriptions (of the same 
question type) from other images ll22ll . The hard task is de¬ 
signed more delicately. Instead of randomly choosing from 
the other images, we now only look for those containing 
the same objects as our question image, and then arbitrarily 
pick three of their descriptions. Sometimes, the descriptions 
sampled from “similar” images could also be good answers 
for our questions (later we experiment with using Turkers to 
select less ambiguous multiple-choice questions from this 
set). For the targeted generation task, for question types 
1-5, algorithms generate descriptions given the image, in¬ 
structions, and prompt. For the other question types whose 
prompts are related to some specihc person or object, we 
additionally provide the algorithm with the location of each 
person/object mentioned in the prompt. We also experiment 
with estimating these locations using object detectors. 


5. Analyzing the Visual Madlibs Dataset 


We begin by conducting quantitative analyses of the re¬ 


sponses collected in the Visual Madlibs Dataset in Sec. 5.1 


A main goal is understanding what additional information is 
provided by the targeted descriptions in the Visual Madlibs 
Dataset vs general image descriptions. The MS COCO 
dataset ED collects general image descriptions following a 
similar methodology to previous efforts for collecting gen¬ 
eral image descriptions, e.g. 128113^ . So, we provide further 
analyses comparing the Visual Madlibs to the MS COCO 


descriptions collected for the same images in Sec. 5.2 

5.1. Quantifying Visual Madlibs responses 


We analyze the length, structure, and consistency of the 
Visual Madlibs responses. First, the average length of each 
type of description is shown in the far right column of Ta¬ 
ble [2 Note that descriptions of people tend to be longer 
than descriptions of other objects in the datasej^ 

Second, we use the phrase chunking 0 to analyze which 
phrasal structures are commonly used to fill in the blanks 
for different questions. Fig. top row, shows relative fre¬ 
quencies for the top-5 most frequent templates used for sev¬ 
eral question types. Object attributes are usually described 
briefly with a simple adjectival phrase. On the other hand, 
people use more words and a wider variety of structure to 
describe possible future events. Except for future and past 
descriptions, the distribution of structures is generally con¬ 
centrated on a few likely choices for each question type. 


^Also note that the length of the prompts varies slightly depending on 
the object names used to instantiate the Madlib, hence the fractional values 
in the mean length of the prompts shown in gray. 
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Imano'c fiitiiro 


nhiorf'c aH-rihiifo 


f~lhiort’c affrtrrtanro 


Person's activity 



Figure 4; First row shows top-5 most frequent phrase templates for image’s future, object’s attribute, object’s affordance and 
person’s activity. Second row shows the histograms of similarity between answers. 


Third, we analyze how consistent the Mechanical Turk 
workers’ answers are for each type of question. To com¬ 
pute a measure of similarity between a pair of responses we 
use the cosine similarity between representations of each 
response. A response is represented by the mean of the 
Word2Vec ll25l vectors for each word in the response, fol¬ 
lowing II 22 II 20 I . Word2Vec is a 300 dimensional embedding 
representation for words that encodes the distributional con¬ 
text of words learned over very large word corpora. This 
measure takes into account the actual words used in a re¬ 
sponse, as opposed to the previous analyses of parse struc¬ 
ture. Each Visual Madlibs question is answered by three 
workers, providing 3 pairs for which similarity is computed. 
Fig. 0 bottom row, shows a histogram of all pairwise simi¬ 
larities for several question types. Generally the similarities 
have a normal-like distribution with an extra peak around 1 
indicating the fraction of responses that agree almost per¬ 
fectly. Once again, descriptions of the future and past are 
least likely to be (near) identical, while object attributes and 
affordances are often very consistent. 

5.2. Visual Madlibs vs general descriptions 

We compare the targeted descriptions in the Visual 
Madlibs Dataset to the general image descriptions in MS 
COCO. First, we analyze the words used in Visual Madlibs 
compared to MS COCO descriptions of the same images. 
For each image, we extract the unique set of words from all 
descriptions of that image from both datasets, and compute 
the coverage of each set with respect to the other. We find 
that on average (across images) 22.45% of the Madlibs’s 
words are also present in MSCOCO descriptions, while 
52.38% of the COCO words are also present in Madlibs. 

Second, we compare how Madlibs and MS COCO an¬ 
swers describe the people and objects in images. We oh- 


a) person's attribute c) person's activity/pair's relationship 

The person IS ail old man m red coat . The person is runnmo.- 

The person is waitiiie on the boat. 

- refer name: man + person . 

- general attribute: old 

- affiliate attribute; red 

- affiliate name; coal 

b) object's attribute 
The dog is large. 

- object attribute; large 


Figure 5; Template used for parsing person’s attributes, 
activity and interaction with object, and object’s attribute. 
The percentages below compares Madlibs and MSCOCO 
on how frequent these templates are used for description. 



Figure 6: Frequency that a word in a position in the people 
and object parsing template in one dataset is in the same 
position for the other dataset. 

serve that the Madlibs questions types, Table[T] cover much 
of the information in MS COCO descriptions ll20l . As one 
way to see this, we run the StanfordNLP parseij^ on both 
datasets. For attributes of people, we use the parsing tem¬ 
plate shown in Fig.|^a) to analyze the structures being used. 

'http://nlp.stanford.edu/software/lex-parser. 

shtml 



MSCOCO 

Mad Libs 

Refer 

General att 
Affiliate att 
Affiliate obj 
Verb 

Object att 

95T% 

15,5% 

1.8% 

7.4% 

79,0% 

18.7% 

46.5 % (100%) 
37.3% 

16.3% 

29.6% 

95.4% 

86.8% 
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The refer name indicates whether the person was mentioned 
in the description. Note that the Madlibs descriptions al¬ 
ways have one reference to a person in the prompt (The 
person is [blank].). Therefore, for Madlibs, we report the 
presence of additional references to the person (e.g., the 
person is a man). The general attribute directly describes 
the appearance of the person or object (e.g., old or small); 
the affiliate object indicates whether additional objects are 
used to describe the targeted person (e.g. with a bag, coat, 
or glasses) and the affiliate attribute are appearance char¬ 
acteristics of those secondary objects (e.g., red coat). The 
templates for object’s attribute and verbs are more straight¬ 
forward as shown in Fig.|^b)(c). The table in Fig. |^shows 
the frequency of each parse component. Overall, more of 
the potential descriptive elements in these constructions are 
used in response to the Madlibs prompts than in the general 
descriptions found in MS COCO. 

We also break down the overlap between Visual Madlibs 
and MS COCO descriptions over different parsing tem¬ 
plates for descriptions about people and object (Fig. [^. 
Yellow bars show how often words for each parse type in 
MSCOCO descriptions were also found in the same parse 
type in the Visual Madlibs answers, and green bars measure 
the reverse direction. Observations indicate that Madlibs 
provides more coverage in its descriptions than MS COCO 
for all templates except for person’s refer name. One possi¬ 
ble reason is that the prompts already indicates “the person” 
or “people” explicitly, so workers need not add an additional 
reference to the person in their descriptions. 

Extrinsic comparison of Visnal Madlibs Data and gen¬ 
eral descriptions: Here we provide an extrinsic analysis of 
the information available in the general descriptions com¬ 
pared to Visual Madlibs. We perform this analysis by using 
either: a) the MS COCO descriptions for an image, or b) 
Visual Madlibs responses from other Turkers for an image, 
to select answers for our multiple-choice evaluation task. 
Specifically, we use one of the human provided descrip¬ 
tions, either from Madlibs or from MS COCO, and select 
the multiple-choice answer that is most similar to that de¬ 
scription. Similarity is measured as cosine similarity be¬ 
tween the mean Word2Vec vectors for the words a descrip¬ 
tion compared to the Word2Vec vectors of the multiple- 
choice answers. In addition to comparing how well the 
Madlibs or MS COCO descriptions can select the correct 
multiple-choice answer, we also use the descriptions au¬ 
tomatically produced by a recent natural language genera¬ 
tion system (CNNh-LSTM 1^ . implementation from lITSl l 
trained on MS COCO dataset. This allows us to make one 
possible measurement of how close current automatically 
generated image descriptions are to our Madlibs descrip¬ 
tions. Fig. 1^ shows the accuracies resulting from using 
Madlibs, MSCOCO, or CNNh-LSTM II^ to select the cor¬ 
rect multiple-choice answer. 


0 20% 40% 60% 80% 



Figure 7: The accuracy of Madlibs, MS COCO and 
CNNh-LSTM 1(3^ (trained on MS COCO) used as refer¬ 
ences to answer the Madlibs hard multiple-choice ques¬ 
tions. 

Although this approach is quite simple, it allows us we 
make two interesting observations. First, Madlibs outper¬ 
forms MS COCO on all types of multiple-choice questions. 
If Madlibs and MS COCO descriptions provided the same 
information, we would expect their performance to be com¬ 
parable. Presumably the performance increase for Madlibs 
is due to the coverage of targeted descriptions compared 
to MS coco’s sentences that describe the overall image 
content more generally. Second, the automatically gen¬ 
erated descriptions from the pre-trained CNNh-LSTM per¬ 
form much worse than the actual MS COCO descriptions, 
despite doing quite well on general image description gen¬ 
eration (The BLEU-1 score of CNNh-LSTM, 0.67, is near 
human agreement 0.69 on MS COCO 1321). 

6. Experiments 

In this section we evaluate a series of methods on the Vi¬ 
sual Madlibs Dataset for the targeted natural language gen¬ 
eration and multiple-choice question answering tasks, in¬ 
troduced in Sec. As methods, we evaluate simple joint¬ 
embedding methods - canonical correlation analysis (CCA) 
and normalized CCA (nCCA) iMl - as well as a recent 
deep-learning based method for image description gener¬ 
ation - CNNh-LSTM ||32l. We train these models on 80% 
of the images in the MadLibs collection and evaluate their 
performance on the remaining 20%. 

In our experiments we extract image features using the 
VGG Convolutional Neural Network (CNN) 1291 . This 
model has been trained on the ILSVRC-2012 dataset to rec- 
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Easy Task 



Table 2: Accuracies computed for different approaches on 
the easy and hard multiple-choice answering task, and the 
filtered hard question set. CCA, nCCA, and CNNh-LSTM 
are trained on the whole image representation for each 
type of question. nCCA(box) is trained and evaluated on 
ground-truth bounding-boxes from COCO segmentations. 
nCCA(all) trains a single embedding using all question 
types. 



Easy Task 

Hard Task 


#Q 

nCCA 

nCCA 

(bbox) 

nCCA 

(dbox) 

nCCA 

nCCA 

(bbox) 

nCCA 

(dbox) 

6. obj attr 

2021 

47.6% 

53.6% 

51.4% 

43.9% 

47.9% 

45.2% 

9. per attr 

4206 

50.2% 

55.4% 

51.2% 

40.0% 

47.0% 

43.3% 


Table 3; Multiple-choice answering using automatic de¬ 
tection for 42 object/person categories, “bbox” denotes 
ground-truth bounding box and “dbox” denotes detected 
bounding box. 

ognize images depicting 1000 object classes, and generates 
a 4,096 dimensional image representation. On the sentence 


side, we average the Word2Vec of all words in a sentence to 
obtain a 300 dimensional representation. 

CCA is an approach for finding a joint embedding be¬ 
tween two multi-dimensional variables, in our case image 
and text vector representations. In an attempt to increase the 
flexibility of the feature selection and for improving com¬ 
putational efficiency, Gong et al. oa proposed a scalable 
approximation scheme of explicit kernel mapping followed 
by dimension reduction and linear CCA. In the projected 
latent space, the similarity is measured by the eigenvalue- 
weighted normalized correlation. This method, nCCA, pro¬ 
vides high-quality retrieval results, improving over the orig¬ 
inal CCA performance significantly M- 

We train CCA and nCCA models for each question type 
separately using the training portion of the Visual Madlibs 
Dataset. These models allow us to map from an image rep¬ 
resentation, to the joint-embedding space, to vectors in the 
Word2Vec space, and vice versa. For targeted generation, 
we map an image to the joint-embedding space and then 
choose the answer from the training set text that is closest to 
this embedded point. In order to answer a multiple-choice 
question we embed each multiple choice answer, and then 
select the answer who’s embedding is closest to image. 

Following the recent “Show and Tell” description gener¬ 
ation technique ll^ (using an implementation from ifTSl l. 
we train a CNNh-LSTM model for each question type on 
the Visual Madlibs training set. This approach has demon¬ 
strated state of the art performance on generating general 
natural language descriptions for images. These models 
directly learn a mapping from an image to a sequence of 
words which we can use to evaluate the targeted genera¬ 
tion task. Note that we input the words from the prompt, 
e.g.. The chair is, and then let the CNNh-LSTM system 
generate the remaining words of the descriptiorj^ For the 
multiple choice task, we compute cosine similarity between 
Word2Vec representations of the generated description and 
each question answer and select the most similar answer. 

6.1. Discussion of results 

Table 1^ shows accuracies of each algorithm on the easy 
and hard versions of the multiple-choice task. Fig.|^ shows 
example correct and wrong answer choices. There are sev¬ 
eral interesting observations we can make. First, train¬ 
ing nCCA on all types of question together, labeled as 
nCCA(all), is helpful for the easy variant of the task, how¬ 
ever it is less useful on the “fine-grained” hard version of the 
task. Second, extracting visual features from the bounding 
box of the relevant person/object yields higher accuracy for 
predicting attributes, but not for other questions. Based on 
this finding, we try answering the attribute question using 
automatic detection methods. The detectors are trained on 

^The missing entries for questions 7 and 12 are due to this priming 
failing for a fraction of the questions. 
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One or two seconds before this picture was taken, 
he hit the ball 
r~) shewonthegame 
Q a man will move 
D theracketwill move 



Person B is_. 

n a girl with hair pulled back 
□ a man in a dark suit 
Q a girl in a purple jacket 
D a man in a red helmet 



The most interesting aspect of this picture is 
n the biker's position 
Q the body ofthe bic/cle 
^ the blue motorcycle 
n the child 


When I look at this picture, I feel 
(Z1 weird 
□ uncomfortable 
n excited 
D concerned 



The umbrella is_. 

□ white on top with colorful wheels 
orange with green and black stripes 

□ white and decorated with blue frosting 

□ decorated with a bunny and basket 



People could_the umbrella . 

D shade themselves with 

□ model with 

□ keep from getting sunburned with 
13 cover their heads with 



The people are_. 

□ in a vehicle 


□ on a motorcycle 

□ down the street 

Zf on the sidewalk 



The person is, ,the TV. 
D feeding 

□ covering 
0 by 

□ holding 



Figure 8: Some example question-answering results from nCCA. First row shows correct choices. Second row shows incor¬ 
rect choices. 



nCCA 

BLEU-1 

nCCA(bbox) 

CNN+LSTM 

nCCA 

BLEU-2 

nCCA(box) 

CNN+LSTM 

1. scene 

0.52 

- 

0.62 

0.17 

- 

0.19 

2. emotion 

0.17 

- 

0.39 

0 

- 

0 

3. future 

0.38 

- 

0.32 

0.12 

- 

0.08 

4. past 

0.39 

- 

0.42 

0.12 

- 

0.11 

5. interesting 

0.49 

- 

0.51 

0.14 

- 

0.15 

6. obj attr 

0.28 

0.36 

0.45 

0.02 

0.02 

0.01 

7. obj aff 

0.56 

0.60 

- 

0.10 

0.11 

- 

8. obj pos 

0.53 

0.55 

0.71 

0.24 

0.25 

0.50 

9. per attr 

0.26 

0.29 

0.55 

0.06 

0.07 

0.25 

10. per act 

0.47 

0.41 

0.52 

0.14 

0.11 

0.22 

11. per loc 

0.52 

0.46 

0.64 

0.22 

019 

0.39 

12. pairrel 

0.46 

0.48 

- 

0.07 

0.08 

- 


Table 4; BLEU-1 and BLEU-2 computed on Madlibs testing dataset for different approaches. 


ImageNet using R-CNN ifTSl . covering 42 MS COCO cat¬ 
egories. We observe similar performance between ground- 
truth and detected bounding boxes in Table 

As an additional experiment we ask humans to answer 
the multiple choice task, with 5 Turkers answering each 
question. We use their results to biter out a subset of 
the hard multiple-choice questions where at least 3 Turk¬ 
ers choose the correct answer. Results of the methods on 
this subset are shown in Table |2] bottom set of rows. These 
results show the same pattern as on the unfiltered set, with 
slightly higher accuracy. 

Table|^shows BLEU-1 and BLEU-2 scores for targeted 
generation. Although the CNNh-LSTM models we trained 
on Madlibs were not quite as accurate as nCCA for selecting 
the correct multiple-choice answer, they did result in better, 
sometimes much better, accuracy (as measured by BLEU 
scores) for targeted generation. 


7. Conclusions 

We have introduced a new hll-in-the blank strategy for 
targeted natural language descriptions and used this to col¬ 
lect a Visual Madlibs dataset. Our analyses show that these 
descriptions are usually more detailed than generic whole 
image descriptions. We also introduce a targeted nam- 
ral language description generation task, and a multiple- 
choice question answering task, then train and evaluate 
joint-embedding and generation models. Data produced by 
this paper will be publicly released upon acceptance. 
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